Skip to content

[Task Manager] Log at different levels based on the state#101751

Merged
chrisronline merged 24 commits intoelastic:masterfrom
chrisronline:alerting/tm_health_api_logging
Jun 16, 2021
Merged

[Task Manager] Log at different levels based on the state#101751
chrisronline merged 24 commits intoelastic:masterfrom
chrisronline:alerting/tm_health_api_logging

Conversation

@chrisronline
Copy link
Copy Markdown
Contributor

@chrisronline chrisronline commented Jun 9, 2021

Relates to #101505

This PR introduces logic that will change how we log the monitoring stats to the Kibana server log:

.subscribe(([monitoredHealth, serviceStatus]) => {
  serviceStatus$.next(serviceStatus);
  logger.debug(`Latest Monitored Stats: ${JSON.stringify(monitoredHealth)}`);
});

Currently, we write a debug log entry every time an event is pushed into the stream (not every as we utilize throttling) which is helpful, if verbose logging is configured by the user. More commonly, users do not have this configured (as it does involve seeing a lot of noise) so this logging paradigm has limited uses (a user would need to know there was a problem, restart Kibana with the config change and then observe the metrics - assuming the problem happens regularly)

This PR changes that by writing to a different log level based on a few things:

  • The status of each "bucket" of the API response (currently: runtime, configuration, and workload). If the status is Warning, then we log as a warning. If the status is Error, we log as an error.
  • The worst case calculated drift (across all task types), represented by stats.runtime.value.drift.p99, is above a configurable threshold (which defaults to 1m for no particular reason so any insight here would be great) - if this happens, we log as a warning.

This will help ensure these metrics are written to the logs when task manager is under performing and will give valuable insight into the why.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature:Task Manager release_note:enhancement Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v7.14.0 v8.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs

7 participants